structural biology
OpenProteinSet: Training data for structural biology at scale
Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.
SO(3)-invariant PCA with application to molecular data
Fraiman, Michael, Hoyos, Paulina, Bendory, Tamir, Kileel, Joe, Mickelin, Oscar, Sharon, Nir, Singer, Amit
ABSTRACT Principal component analysis (PCA) is a fundamental technique for dimensionality reduction and denoising; however, its application to three-dimensional data with arbitrary orientations--common in structural biology--presents significant challenges. A naive approach requires augmenting the dataset with many rotated copies of each sample, incurring prohibitive computational costs. In this paper, we extend PCA to 3D volumetric datasets with unknown orientations by developing an efficient and principled framework for SO(3)-invariant PCA that implicitly accounts for all rotations without explicit data augmentation. By exploiting underlying algebraic structure, we demonstrate that the computation involves only the square root of the total number of covariance entries, resulting in a substantial reduction in complexity. Index T erms-- steerable PCA, group invariants, 3D volumes, cryo-EM, spherical Bessel, ball harmonics 1. INTRODUCTION Principal component analysis (PCA) is a fundamental technique in data science and statistics, especially when dealing with high-dimensional datasets.
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.05)
- North America > United States (0.05)
- Asia > China (0.04)
SaSi: A Self-augmented and Self-interpreted Deep Learning Approach for Few-shot Cryo-ET Particle Detection
Adethya, Gokul, Mantha, Bhanu Pratyush, Wang, Tianyang, Li, Xingjian, Xu, Min
Cryo-electron tomography (cryo-ET) has emerged as a powerful technique for imaging macromolecular complexes in their near-native states. However, the localization of 3D particles in cellular environments still presents a significant challenge due to low signal-to-noise ratios and missing wedge artifacts. Deep learning approaches have shown great potential, but they need huge amounts of data, which can be a challenge in cryo-ET scenarios where labeled data is often scarce. In this paper, we propose a novel Self-augmented and Self-interpreted (SaSi) deep learning approach towards few-shot particle detection in 3D cryo-ET images. Our method builds upon self-augmentation techniques to further boost data utilization and introduces a self-interpreted segmentation strategy for alleviating dependency on labeled data, hence improving generalization and robustness. As demonstrated by experiments conducted on both simulated and real-world cryo-ET datasets, the SaSi approach significantly outperforms existing state-of-the-art methods for particle localization. This research increases understanding of how to detect particles with very few labels in cryo-ET and thus sets a new benchmark for few-shot learning in structural biology.
OpenProteinSet: Training data for structural biology at scale
Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it.
"ML-Everything"? Balancing Quantity and Quality in Machine Learning Methods for Science
Recent research in machine learning (ML) has led to significant progress in various fields, including scientific applications. However, there are limitations that need to be addressed to ensure the validity of new models, the quality of testing and validation procedures, and the actual applicability of the developed models to real-world problems. These limitations include unfair, subjective, and unbalanced evaluations, not necessarily intentional yet there, the use of datasets that don't properly reflect real-world use cases (for example that are "too easy"), incorrect ways to split datasets into training, testing, and validation subsets, etc. In this article I will discuss all these points, using examples from the domain of biology which is being revolutionized by ML methodologies. Along the way I will also briefly touch on the interpretability of ML models, which is today very limited but very important because it could help clarify many of the aspects discussed in the first part of the article regarding the limitations that need to be addressed.
How AlphaFold can realize AI's full potential in structural biology
Tomorrow's AI applications will not happen without research being shared openly in repositories such as that maintained by the European Molecular Biology Laboratory's European Bioinformatics Institute near Cambridge, UK.Credit: Edmund Sumner/View Pictures/Universal Images Group/Getty "I wake up and type AlphaFold into Twitter." He was talking to Nature in April for a News Feature on how software that can predict the 3D shape of proteins from their genetic sequence is changing biology (Nature 604, 234–238; 2022). Jumper leads the team at London-based company DeepMind that developed the AlphaFold software. Last week, DeepMind, part of the Google family, announced that its researchers have used AlphaFold to predict the structure of 214 million proteins from more than one million species -- essentially all known protein-coding sequences. 'The entire protein universe': AI predicts shape of nearly every known protein AlphaFold is clearly one of the most exciting developments to hit the life sciences in recent decades.
Whither structural biologists?
Between December 2020 and July 2021, several spectacular developments in the field of protein-structure prediction changed structural biology profoundly, and they are expected to have an impact on much of modern (molecular) biology, medicine, biochemistry and biotechnology. The unprecedented accuracy of blind protein-structure predictions produced by DeepMind's AlphaFold2 was revealed at the CASP 14 meeting in December 2020. In July 2021, this was followed by publication of the method and release of the code (Jumper et al., 2021). Simultaneously, a prediction method from the Baker lab that achieved similar accuracy was published (Baek et al., 2021). A week later, an additional publication described proteome-scale application of protein-structure prediction using AlphaFold2.
In Its Greatest Biology Feat Yet, AI Unlocks the Complex Proteins Guarding Our DNA
AI has done it again. After solving one of the grandest mysteries in biology--predicting protein structure--it decoded how proteins link up into complexes, and dreamed up novel protein structures that may ultimately be turned into drugs to control our basic biology, health, and life. Yet when faced with enormous protein complexes, AI faltered. In a mind-bending feat, a new algorithm deciphered the structure at the heart of inheritance--a massive complex of roughly 1,000 proteins that helps channel DNA instructions to the rest of the cell. The AI model is built on AlphaFold by DeepMind and RoseTTAfold from Dr. David Baker's lab at the University of Washington, which were both released to the public to further experiment on.